Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python 3 by Artem Kovera
Author:Artem Kovera [Kovera, Artem]
Language: eng
Format: azw3
Published: 2017-10-21T04:00:00+00:00
The disadvantages of k-means and methods to overcome them
Although the k-means algorithm has some important advantages over other clustering methods, this algorithm has also a number of drawbacks.
The first disadvantage of this algorithm is that depending on the initial positions of the centroids, we can have dramatically different results at the end. So, the k-means algorithm is not deterministic. Also, there is no guarantee that the algorithm will converge to the global optimum. We can partially get around these problems by running the algorithm multiple times and picking the output with the smallest variance. Another solution to this problem is using k-means++ initialization. It is better to use both these methods simultaneously, like we just did using Scikit-learn k-means.
The second and probably most important disadvantage of the k-means algorithm is that we have to specify the number of clusters in advance.
Having some a priori knowledge about the problem can help choose k – the number of clusters. Sometimes we indeed have such knowledge. For example, in clustering astrophysical images, it’s known beforehand that there are two types of brightest objects in the cosmos: galaxies and quasars, so we can determine the number of clusters as two.
When we don’t have enough prior knowledge about the problem, we need to search for a good k. We can implement the algorithm for different values of k and compare the variances of the clusters we get. But in this case, it turns out that the more the clusters, the lower the variance, and, because of this, the best number of clusters is the number of the data points, but it doesn’t make sense, of course.
Instead of choosing results with the lowest variance, we can use the Elbow method. In this method, we run the K-means algorithm with different values of k. We should use a number of clusters such that adding another cluster does not give a substantial difference between the sum of squared errors or, in other words, the ratio of the between-group variance to the total variance.
In the example of using the Elbow method, we will be using the K-means algorithm from the Scikit-learn library and the function cdist for distance computation from the Scipy library:
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Deep Learning with Python by François Chollet(12585)
Hello! Python by Anthony Briggs(9924)
OCA Java SE 8 Programmer I Certification Guide by Mala Gupta(9799)
The Mikado Method by Ola Ellnestam Daniel Brolund(9783)
Dependency Injection in .NET by Mark Seemann(9347)
A Developer's Guide to Building Resilient Cloud Applications with Azure by Hamida Rebai Trabelsi(9312)
Hit Refresh by Satya Nadella(8829)
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8309)
Sass and Compass in Action by Wynn Netherland Nathan Weizenbaum Chris Eppstein Brandon Mathis(7787)
Test-Driven iOS Development with Swift 4 by Dominik Hauser(7770)
Grails in Action by Glen Smith Peter Ledbrook(7703)
The Kubernetes Operator Framework Book by Michael Dame(7676)
The Well-Grounded Java Developer by Benjamin J. Evans Martijn Verburg(7564)
Exploring Deepfakes by Bryan Lyon and Matt Tora(7471)
Practical Computer Architecture with Python and ARM by Alan Clements(7390)
Implementing Enterprise Observability for Success by Manisha Agrawal and Karun Krishnannair(7375)
Robo-Advisor with Python by Aki Ranin(7349)
Building Low Latency Applications with C++ by Sourav Ghosh(7255)
Svelte with Test-Driven Development by Daniel Irvine(7221)
